Necessary to decide which variables to use in model
“d” stands for “directional”
Usually we are dealing with more than two variables
Complication: causation flows only directed - association might flow against
Code
dagify(z ~ x, y2 ~ z, a ~ x, a ~ y3, x ~ d, y1 ~ d,coords =list(x =c(x =1, z =1.5, y2 =2, a =1.5, y3 =2, d =1.5, y1 =2), y =c(x =1, y2 =1, z =1, a =0, y3 =0, d =2, y1 =2))) %>%tidy_dagitty() %>%ggdag(text_size =3, node_size =5) +geom_dag_edges() +theme_dag() +labs(title="Causal Pitchfork", subtitle ="x and y2 are d-connected but x and y1/y3 are not") +theme(title =element_text(size =8))
Analyzing DAGs: Fork
Good Control
Code
med <-dagify( x ~ d, y1 ~ d,coords =list(x =c(x =1, z =1.5, y =2, a =1.5, b =2, d =1.5, y1 =2), y =c(x =1, y =1, z =1, a =0, b =0, d =2, y1 =2))) %>%tidy_dagitty() %>%mutate(fill =ifelse(name =="d", "Confounder", "variables of interest")) %>%ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +geom_dag_point(size=7, aes(color = fill)) +geom_dag_edges(show.legend =FALSE)+geom_dag_text() +theme_dag() +theme(legend.title =element_blank(),legend.position ="top") med
d causes both x and y1
Arrows pointing to x are called “back-door” paths
Eliminated by randomized experiment! Why?
Controlling for d “blocks” the non-causal association x \(\rightarrow\) y1
Analyzing DAGs: Pipe
Bad Control (possibly use mediation analysis)
Code
med <-dagify(z ~ x, y2 ~ z,coords =list(x =c(x =1, z =1.5, y2 =2), y =c(x=1, y2 =1, z=1))) %>%tidy_dagitty() %>%mutate(fill =ifelse(name =="z", "Mediator", "variables of interest")) %>%ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +geom_dag_point(size=7, aes(color = fill)) +geom_dag_edges(show.legend =FALSE)+geom_dag_text() +theme_dag() +theme(legend.title =element_blank(),legend.position ="top") med
x causes y through z
Controlling for z blocks the causal association x \(\rightarrow\) y2
Analyzing DAGs: Collider
Bad control
Code
dagify(a ~ x, a ~ y,coords =list(x =c(x =1, y =2, a =1.5), y =c(x =1, y =0, a =0))) |>tidy_dagitty() |>mutate(fill =ifelse(name =="a", "Collider", "variables of interest")) |>ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +geom_dag_point(size =7, aes(color = fill)) +geom_dag_edges(show.legend =FALSE) +geom_dag_text() +theme_dag() +theme(legend.title =element_blank(),legend.position ="top" )
x & y cause a
x & y are d-separated and uncorrelated
By adding a to the model spurious correlation between x & y is introduced
Exercise
Which variables should be included?
Effect of x on y
Effect of z on y
Code
library(ggdag)library(dagitty)library(tidyverse)dagify(y ~ n + z + b + c, x ~ z + a + c, n ~ x, z ~ a + b, exposure ="x", outcome ="y",coords =list(x =c(n =2, x =1, y =3, a =1, z =2, c =2, b =3), y =c(x =2, y =2, a =3, z =3, c =1, b =3, n =2))) %>%tidy_dagitty() %>%ggdag(text_size =8, node_size =12) +geom_dag_edges() +theme_dag()
library(ggpubr)p1 <-dagify(y ~ x + U2, a ~ U1 + U2, x ~ U1,coords =list(x =c(x =1, y =2, a =1.5, b =1.5, U1 =1, U2 =2), y =c(x=1, y =1, a =1.5, b =0, U1 =2, U2 =2))) %>%tidy_dagitty() %>%mutate(fill =ifelse(name %in%c("U1", "U2"), "Unobserved", "Observed")) %>%ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +geom_dag_point(size=12, aes(color = fill) ) +geom_dag_edges(show.legend =FALSE)+geom_dag_text() +theme_dag() +theme(legend.title =element_blank(),legend.position ="bottom") +labs(title ="M-Bias")p2 <-dagify(y ~ a + U, a ~ x + U,coords =list(x =c(x =1, y =2, a =1.5, b =1.5, U =1.7, U2 =2), y =c(x=1, y =1, a =1, b =0, U =2, U2 =2))) %>%tidy_dagitty() %>%mutate(fill =ifelse(name %in%c("U"), "Unobserved", "Observed")) %>%ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +geom_dag_point(size=12, aes(color = fill) ) +geom_dag_edges(show.legend =FALSE)+geom_dag_text() +theme_dag() +theme(legend.title =element_blank(),legend.position ="bottom") +labs(title ="Post-treatment Bias")ggarrange(p1, p2)
Common bad controls
Code
p1 <-dagify(y ~ x , a ~ x + y,coords =list(x =c(x =1, y =2, a =1.5, b =1.5, U1 =1, U2 =2), y =c(x=1, y =1, a =1.5, b =0, U1 =2, U2 =2))) %>%tidy_dagitty() %>%#mutate(fill = ifelse(name %in% c("U1", "U2"), "Unobserved", "Observed")) %>% ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +geom_dag_point(size=12, #aes(color = fill) ) +geom_dag_edges(show.legend =FALSE)+geom_dag_text() +theme_dag() +theme(legend.title =element_blank(),legend.position ="bottom") +labs(title ="Selection Bias")p2 <-dagify(y ~ x , a ~ y,coords =list(x =c(x =1, y =2, a =1.5, b =1.5, U1 =1, U2 =2), y =c(x=1, y =1, a =1.5, b =0, U1 =2, U2 =2))) %>%tidy_dagitty() %>%#mutate(fill = ifelse(name %in% c("U1", "U2"), "Unobserved", "Observed")) %>% ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +geom_dag_point(size=12, #aes(color = fill) ) +geom_dag_edges(show.legend =FALSE)+geom_dag_text() +theme_dag() +theme(legend.title =element_blank(),legend.position ="bottom") +labs(title ="Case-control Bias")ggarrange(p1, p2)
Intelligence, education, income
Case-control study: Observation ex-post. Ex.: Smoking \(\rightarrow\) lung cancer
Exercise
Prepare a short presentation of a (potential) DAG for your thesis
References
Cinelli, Carlos, Andrew Forney, and Judea Pearl. 2020. “A Crash Course in Good and Bad Controls.”SSRN 3689437.
Imbens, Guido W. 2020. “Potential Outcome and Directed Acyclic Graph Approaches to Causality: Relevance for Empirical Practice in Economics.”Journal of Economic Literature 58 (4): 1129–79. https://doi.org/10.1257/jel.20191597.